Week 03: Parsing & Top-Down Parsing
What is parsing?
- Goal: To derive the parse tree of program in grammar.
- Input: Sequence of tokens from lexer
- Output: Parse tree of the program.
- Sample
- Input:
IF ID = ID THEN INT ELSE INT FI
- Output:
(IF-THEN-ELSE (= ID ID) (INT) (INT))
Context-Free Grammars
- Goal: To separate the token string into many pieces such that every piece is acceptable in grammar.
- Context-Free means every two pieces are not related.
- Parser must distinguish between valid and invalid strings of tokens under CFGs.
- What we need
- A language for describing valid strings of tokens.
- A method for distinguishing valid under the language.
- Note: Programming languages always have recursive structures.
- An EXPR is ... (recursive)
if EXPR then EXPR else EXPR fi
- Context-free grammars are a natural notation for this recursive structure.
- Def. A CFG consists of
- A set of terminals $T$
- A set of non-terminals $N$
- A start symbol $S \in N$
- A set of productions $X \rightarrow Y_1, \dots, Y_N, X\in N, Y_i \in N\cup T\cup \{ \epsilon\}$
- Ex: $S\rightarrow (S)$
- How the production works?
- Begin with a string with only the start symbol $S$
- Replace any non-terminal $X$ in the string by the right-hand side of some production $X \rightarrow Y_1\dots Y_n$
- Repeat (2) until there are no non-terminals
- $X_1\dots X_i X X_{i+1}\dots X_N \rightarrow X_1\dots X_i Y_1\dots Y_k X_{i+1}\dots X_N,\ if\ X\rightarrow Y_1\dots Y_k$
- $S \rightarrow \dots \rightarrow \alpha_0 \rightarrow \alpha_1 \rightarrow \dots \rightarrow \alpha_n, if\ \alpha_0 \rightarrow^* \alpha_n (\geq 0\ steps)$
- Def. (Language) Use grammar to define language.
- Let $G$ be a context-free grammar with start symbol $S$. Then the language $L(G)$ of $G$ is $$\{ a_1\dots a_n\ |\ \forall i\ (a_i\in T)\wedge (S\rightarrow^* a_1\dots a_n) \}$$
- Terminals ought to be tokens of the language.
- Membership in a language is yes or no; also need parse tree of the input.
- Must handle errors gracefully.
- Need an implementation of CFG’s (e.g., bison)
- Form of the grammar is important.
- Many grammars generate the same language.
- Tools are sensitive to the grammar.
Derivations
- Goal: To define the process from the start symbol to some valid token string.
- Def. A derivation is a sequence of productions applying on a start symbol $S$ to sequences without any non-terminal.
- We can draw a derivation as a tree
- Start symbol is the tree’s root
- For a production $X \rightarrow Y_1\dots Y_n$ add children $Y_1\dots Y_n$ to node $X$.
- Sample
- $G$: $E\rightarrow E+E\ |\ E*E\ |\ (E)\ |\ id$
- Target: $id*id+id$
- Derivation: $E \rightarrow E+E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
- Draw derivation as a parse tree: $(((id) * (id)) + (id))$
- The example is a left-most derivation. At each step, replace the left-most non-terminal.
- Note that right-most and left-most derivations have the same parse tree.
- A parse tree has
- Terminals at the leaves
- Non-terminals at the interior nodes
- An in-order traversal of the leaves is the original input.
- We are not just interested in whether $s \in L(G)$.
- We need a parse tree for $s$.
- The relation between derivation and parse tree are many to one.
Ambiguity
- Goal: A parse tree is used to describe how we explain the program string. If there are two parse trees for the program, it has two meanings.
- Sample
- $G$: $E\rightarrow E+E\ |\ E*E\ |\ (E)\ |\ id$
- Target: $id*id+id$
- This string has two parse trees
- Derivation: $E \rightarrow E+E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
- Derivation: $E \rightarrow E*E \rightarrow E*E+E \rightarrow id*E+E \rightarrow id*id+E \rightarrow id*id+id$
- Def. A grammar is ambiguous, if it has more than one parse tree for some string $s\in L(G)$.
- A lot of ways to solve it.
- Most direct method is to rewrite grammar unambiguously.
- Like, enforces precedence of * over +.
- Impossible to convert automatically an ambiguous grammar to an unambiguous one.
- Most tools allow precedence and associativity declarations to disambiguate grammars.
Error
Error
- Goal
- To detect non-valid programs
- To translate the valid ones
- Sample
- (Lexer) Errors in Lexical:
...$...
- (Parser) Errors in Syntax:
...x *%...
- (Type checker) Errors in Semantic:
...int x; y=x(3);...
- (Tester/User) Errors in Correctness: program
Error Handler
- Goal
- Report errors accurately and clearly
- Recover from an error quickly
- Not slow down compilation of valid code
- What to do
- Panic mode
- Discard tokens until one with a clear role is found. And continue.
- Bison: use the special terminal error to describe how much input to skip.
- Error Productions
- Specify known common mistakes in the grammar, like
5x
and 5*x
...
- Complicates the grammar.
- Automatic local or global correction
- Try token insertions and deletions. (Edit distance)
- Exhaustive search.
Abstract Syntax Trees (AST)
- Goal: Represent the parse tree and ignore the details.
- Form:
(op (op1) (op2),...,(opn))
(IF-THEN-ELSE (< (x) (5)) (= (y) (4)) (= (y) (3)))
Top-Down Parsing
- Recursive Descent Parsing
- The parse tree is constructed
- From the top
- From left to right
- Use DFS to match the token string
- Let the global pointer
next
point to the next input token: TOKEN *next = array;
- Try to match a given token terminal:
bool term(TOKEN tok) { return *next++ == tok; }
- Try to mathc the $n^{th}$ production of S:
bool Sn() { ... }
- For production $E\rightarrow T$:
bool E1() { return T(); }
- For production $E\rightarrow T+E$:
bool E2() { return T() && term(PLUS) && E(); }
- For all productions of E (with backtracking)
bool E() { TOKEN *save = next; return (next = save, E1()) || (next = save, E2()); }
- Error in Recursive Descent
- Scenario: $S \rightarrow Sa$
- It goes into an infinite loop.
- RD does not work in left-recursive grammar.
- Def. A left-recursive grammar has a non-terminal $S$, $S\rightarrow^+ S\alpha$, for some $\alpha$
- Rewrite all production rules to use right-recursion.